code
share


β–Ί Chapter 12: Introduction to nonlinear learning

12.5 Universal approximation

In Sections 12.1 - 12.3 we saw how supervised and unsupervised learners alike can be extended to perform nonlinear learning via the use of arbitrary linear combination of nonlinear functions / feature transformations. However the general problem of engineering an appropriate nonlinearity for a given dataset has thus far remained elusive. In this Section we introduce the first of two major tools for handeling this task: universal approximators. Universal approximators are families of simple nonlinear feature transformations whose members can be combined to create artbitrarily complex nonlinearities like any we would ever expect to find in a supervised or unsupervised learning dataset. Here we will also introduce the three standard types of universal approximators employed in practice today - kernels, neural networks, and trees - of which we will have much more to say of in future Chapters.

InΒ [1]:
# imports from custom library
import sys
sys.path.append('../../')
import autograd.numpy as np
from mlrefined_libraries import nonlinear_superlearn_library as nonlib
datapath = '../../mlrefined_datasets/nonlinear_superlearn_datasets/'
%matplotlib notebook

# this is needed to compensate for %matplotlib notebook's tendancy to blow up images when plotted inline
from matplotlib import rcParams
rcParams['figure.autolayout'] = True

%load_ext autoreload
%autoreload 2

12.5.1 Universal approximationΒΆ

InΒ [Β ]:
 

One can think of a universal approximator the way we think about the periodic table of atomic elements and physical substances - with any substance reduciable to a combination of its members - by themselves members of a universal approximator are simple, but when combined appropriately they can create any nonlinear feature transformation desired (also called universal approximation).

InΒ [Β ]:
 
InΒ [Β ]:
 

What we will see here is that any nonlinearity $h\left(\mathbf{x}\right)$ - whether it be an appropriate nonlinear model for a supervised learning dataset, or nonlinear encoder / decoder for unsupervised data, can be approximated by a linear combination of $B$ universal approximators $f_1,\,f_2,\,...\,f_B$ as

\begin{equation} h\left(\mathbf{x}\right) \approx w_0 + {w}_{1}\,f_1\left(\mathbf{x}\right) + {w}_{2}\,f_2\left(\mathbf{x}\right) + \cdots + w_B\,f_B\left(\mathbf{x}\right) \end{equation}

where each of the nonlinear functions / universal approximators on the right hand side can have internal parameters.

12.5.1 The perfect kind of data for learningΒΆ

In a perfect world our desired approximations $\approx$ can attain a strict equality $=$ i.e., $\text{model}\left(\mathbf{x}_p,\mathbf{w}\right) = y_p$. In such a perfect instance the dataset lies precisely on a line (or more generally, a hyperplane). The most perfect dataset we could possibly have for linear regression is then - ignoring computation - a line (or hyperplane) itself (as illustrated below in the animated gif).

In complete analogy to the linear case, here our perfect dataset would consist of points where $\text{model}\left(\mathbf{x}_p, \mathbf{w}\right) = y_p$, i.e., points lying perfectly on the nonlinear curve (or in general the nonlinear surface) defined by $f$. The most perfect dataset we could possibly have for nonlinear regression is then - ignoring computation - a curve (or surface) itself (as illustrated below in the animated gif).

Figure 1: In this gif we start off by showing a realistic linear regression dataset (a small noisy set of points that can be roughly modeled by a line). We then progress to remove all noise from those points (making them all lie perfectly on some line), and finally show the perfect linear regression scenario where we have infinitely many points lying perfectly on a line. In this gif we start off by showing a realistic nonlinear regression dataset (a small noisy set of points that can be roughly modeled by a nonlinear curve). We then progress to remove all noise from those points (making them all lie perfectly on some curve), and finally show the perfect nonlinear regression scenario where we have infinitely many points lying perfectly on the curve itself.

Examining our setup, here for linear two-classification our perfect dataset would consist of points where $\text{model}\left(\mathbf{x}_p,\mathbf{w}\right) = y_p$, i.e., points lying perfectly on the step function with linear boundary given by $\text{model}\left(\mathbf{x},\mathbf{w}\right) = 0$. The most perfect dataset we could possibly have for linear classification is then - ignoring computation - a perfect step function (with linear boundary) itself. This is illustrated below in the animated gif.

Examining our setup, here for nonlinear two-classification our perfect dataset would consist of points where the nonlinear decision boundary given by $\text{model}\left(\mathbf{x}_p,\mathbf{w}\right) = 0$ perfectly separates the data. In other words, that our points lie perfectly on the step function with this nonlinear boundary. The most perfect dataset we could possibly have for nonlinear classification is then - ignoring computation - a perfect step function (with nonlinear boundary) itself. This is illustrated below in the animated gif.

Figure 1: In this gif we start off by showing a realistic linear classification dataset (a small noisy set of points that can be roughly modeled by a step function). We then progress to remove all noise from those points (i.e., returning the true label values to our 'noisy' points), and finally show the perfect linear classification scenario where we have infinitely many points lying perfectly on a step function with linear boundary. In this gif we start off by showing a realistic nonlinear classification dataset (a small noisy set of points that can be roughly modeled by a step function with nonlinear boundary). We then progress to remove all noise from those points (i.e., returning the true label values to our 'noisy' points), and finally show the perfect nonlinear classification scenario where we have infinitely many points lying perfectly on a step function with nonlinear boundary.

1.1 Combinations of random nonlinear functionsΒΆ

Before jumping in to discussing organized families of nonlinear functions, here we illustrate how adding together random functions can fail to provide the sort of fine-grained nonlinear modeling we are after. Here we will explore fitting a combination of four nonlinear functions to the noisy sinusoidal dataset shown in the next Python cell.

InΒ [8]:
# create instance of linear regression demo, used below and in the next examples
csvname =  datapath + 'noisy_sin_sample.csv'
demo1 = nonlib.demos_part_2.Visualizer(csvname)
demo1.show_pts()

While we saw in the previous post that we could determine an appropriate nonlinearity, here suppose this is the sort of dataset whose nonlinearity we will have to find via searching over combinations of nonlinear features.

First, suppose we try out various combinations of the following three functions

\begin{equation} f_1(x) = x ~~~~~~~~~~~~~~ f_2(x) = x^2 ~~~~~~~~~~~~~~ f_3(x) = \text{sinc}(10x + 1) \end{equation}

which we plot in the next Python cell.

InΒ [10]:
# plot our features
demo1.plot_feats(version = 1)

We will try combinations of these three features in succession. That is we will first try using the first one with a predict function that is simply the weighted sum of $f_1$

\begin{equation} \text{predict}(x,\omega) = w_0 + w_1f_1(x). \end{equation}

Remember here the notation $\omega$ denotes the entire set of weights used (here just $w_0$ and $w_1$). We fit this model to the given dataset by minimizing the corresponding Least Squares cost.

Next we will try a weighted combination of the first two elements $f_1$ and $f_2$

\begin{equation} \text{predict}(x,\omega) = w_0 + w_1\,f_1(x) + w_2\,f_2(x) \end{equation}

tuning these weights properly by fitting to the dataset.

Finally we try a weighted combination of all three features as

\begin{equation} \text{predict}(x,\omega) = w_0 + w_1\,f_1(x) + w_2\,f_2(x) + w_3\,f_3(x) \end{equation}

tuning the weights by fitting to the dataset.

With the weights of all three models tuned lets plot the resulting representations. In the left, middle, and right panels below we plot the dataset and the resulting fit from the first, second, and third models respectively.

InΒ [11]:
# plots showing the first (left), second (middle), and third (right) model predictions
demo1.show_fits(version = 1)

Here none of the predictions are adequate. With the first two we have failed to fully capture the nonlinear nature of the data, and with the third we have greatly overestimated this nonlinearity.

Moreover, while the nonlinearity of the first two predictions changes fairly gradually, the leap in nonlinearity from the second to third predictions is substantial. This is due to the fact that while the first two features - monomials of two successive degrees - are closely related, the third feature is drastically different than the first two. With such drastically different features we are unable to fine tune dial the sort of nonlinearity we want.

If we swap out the third sinc feature for the next degree three monomial term - giving the three features

\begin{equation} f_1(x) = x ~~~~~~~~~~~~~~ f_2(x) = x^2 ~~~~~~~~~~~~~~ f_3(x) = x^3 \end{equation}

we not only gain significantly more fine-grained control over how much nonlinearity we introduce into our prediction, but here can produce a much better result.

We plot all three of these features in the next Python cell.

InΒ [12]:
# plot our features
demo1.plot_feats(version = 2)

And now - just as previous - we use the first of these features, a linear combination of the first two, and then finally a combination of all three to make predictions. In the next cell we plot the first, second, and third predictions in the left, middle, and right panels respectively.

InΒ [13]:
# show first round of fits
demo1.show_fits(version = 2)

By using three related features - as opposed to having one odd-ball nonlinearity - we get a fantastic fit using all three features this time. Moreover the change in nonlinearity is more gradual, predictable, and controllable across the three predictions.

This example highlights an important point: we need to be organized and thoughtful in our search for the right combination of features. Simply taking random combinations of various nonlinear functions as a predictor e.g., like

\begin{equation} \text{predict}(x,\omega) = w_0 + w_1\text{tanh}\left(w_2 + w_3x\right) + w_4e^{w_5x} + w_6x^{10} + \cdots \end{equation}

does not generally allow for fine-tuning of nonlinearity in a prediction (nor effecient computation, as we will see). Conversely if we retrict ourselves to using a set of related functions we can better manage the amount of nonlinearity in our models.

1.2 The kernel catalog of functionsΒΆ

The catalog of kernel functions consists of groups of functions with no internal parameters, a primary example being polynoimals. When dealing with just one input this sub-family of functions looks like

\begin{equation} f_1(x) = x, ~~ f_2(x) = x^2,~~ f_3(x)=x^3,... \end{equation}

and so forth with the $D^{th}$ element taking the form $f_D(x) = x^D$. A combination of the first $D$ units from this sub-family is often referred to as a degree $D$ polynomial. There are an infinite number of these functions - one for each positive whole number - and they are naturally ordered by their degree (the higher the degree, the further down the list that function is). Moreover each element has a fixed shape - e.g., the monomial $f_2(x) = x^2$ is always a parabola, and looks like a cup facing upwards. In the next Python cell we plot the first four elements (also called units) of the polynomial sub-family of kernel functions.

InΒ [2]:
# import Draw_Bases class for visualizing various basis element types
demo = nonlib.DrawBases.Visualizer()

# plot the first 4 elements of the polynomial basis
demo.show_1d_poly()

In two inputs $x_1$ and $x_2$ polynomial units take the analagous form

\begin{equation} f_1(x_1,x_2) = x_1, ~~ f_2(x_1,x_2) = x_2^2, ~~ f(x_1,x_2) = x_1x_2^3, ~~ f(x_1,x_2) = x_1^4x_2^6,... \end{equation}

with a general degree $D$ unit taking the form

\begin{equation} f_m(x_1,x_2) = x_1^px_2^q \end{equation}

where where $p$ and $q$ are nonnegative integers and $p + q \leq D$. A degree $D$ polynomial in this case is a linear combinatino of all such units. This sort of definition generalizes to defining polynomial units in general $N$ dimensional input as well.

In the Python cell below we draw a sampling of these polynomial units.

InΒ [3]:
# draw a few polynomial units in two inputs
demo.show_2d_poly()

Example 1: Various predictions using polynomial unitsΒΆ

In this example we animate various predictions of $B$ polynomial units, which generally take the form

\begin{equation} \text{predict}\left(\mathbf{x}, \omega \right) = w_0 + {w}_{1}\,f_1\left(\mathbf{x}\right) + {w}_{2}\,f_2\left(\mathbf{x}\right) + \cdots + w_B\,f_B\left(\mathbf{x}\right). \end{equation}

to a variety of simple datasets. Notice in each how - due to the fact that these features come from the same sub-family of related nonlinear functions - that the corresponding predictions change fairly gradually as additional units are added to the model (in other words - we how can fine-tune the amount of nonlinearity we want in our prediction).

First in the next cell we animate the final trained predictions of fitting first $B = 1$ through $B = 4$ polynomial units to the noisy sinusoid dataset shown in the previous Subsection. As the slider is moved from left to right one additional polynomial unit is added to the model, the linear combination of units is fit to the dataset, and the resulting fit is shown on the data in the left panel. In the right panel we simultaenously show the Least Squares cost function value of each trained model.

InΒ [4]:
demo = nonlib.regression_basis_single.Visualizer()
csvname =  datapath + 'noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fit(basis='poly',num_elements = [v for v in range(1,5)])
Out[4]:



Next, below in the following Python cell we animate the final trained predictions of fitting first $B = 1$ through $B = 10$ polynomial units to a three-dimensional noisy sinusoid dataset. As the slider is moved from left to right one additional polynomial unit is added to the model, the linear combination of units is fit to the dataset, and the resulting fit is shown on the data. We do not plot the corresponding Least Squares cost function value, but as with the above example it decreases as we add more polynomial units to the model.

InΒ [6]:
demo = nonlib.regression_basis_comparison_3d.Visualizer()
csvname =  datapath + '3d_noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fits(num_units =  [v for v in range(1,10)] ,view = [20,110],basis = 'poly')
Out[6]:



And finally, an analagous example with 2-class classification. As you move the slider from left to right additional polynomial units are added to the model. In this case each notch of the slider adds multiple units - as we are increasing in $D$ the degree of the total polynoimal (which, for two inputs, means adding several individual polynomiali units each time $D$ is increased).

InΒ [9]:
# load in dataset  http://math.arizona.edu/~dsl/
csvname = datapath + '2_eggs.csv'
demo = nonlib.classification_basis_comparison_3d.Visualizer(csvname)

# run animator
demo.brows_single_fits(num_units =  [v for v in range(0,4)], basis = 'poly',view = [30,-80])
Out[9]:



  • General attributes of kernel bases:
    • classic examples also include: Fourier and cosine bases, radial basis functions
    • each element has a fixed shape (no internal parameters), elements are ordered from simple to complex
    • always leads to convex cost functions for regression / classification
    • great for medium sized problems, but many varieties have serious scaling issues when $N$ and $P$ are large
    • often used with support vector machines (for historical reasons)

3D CLASSIFICATION FIT GOES HERE

2. Neural networksΒΆ

  • Basic example: single hidden layer network with tanh activation
\begin{equation} f_1(x\,,\theta_1) = \text{tanh}\left(w_{1,0} + w_{1,1}x\right), ~~ f_2(x\,,\theta_2) = \text{tanh}\left(w_{2,0} + w_{2,1}x\right), ~~ f_3(x\,,\theta_3) = \text{tanh}\left(w_{3,0} + w_{3,1}x\right), ~~ f_4(x\,,\theta_4) = \text{tanh}\left(w_{4,0} + w_{4,1}x\right), ... \end{equation}
  • Each element has internal parameters, here $\theta_j = \{w_{0,j},w_{1,j}\}$ internal params for $f_j$
  • Internal parameters make each function flexible
  • e.g., take $f_1$ with various internal parameter settings
InΒ [70]:
# import Draw_Bases class for visualizing various basis element types
demo = nonlib.DrawBases.Visualizer()

# plot the first 4 elements of the polynomial basis
demo.show_1d_net(num_layers = 1,activation = 'tanh')
  • Basic example: single hidden layer network with relu activation
\begin{equation} f_1(x\,,\theta_1) = \text{max}\left(0,w_{1,0} + w_{1,1}x\right), ~~ f_2(x\,,\theta_2) = \text{max}\left(0,w_{2,0} + w_{2,1}x\right), ~~ f_3(x\,,\theta_3) = \text{max}\left(0,w_{3,0} + w_{3,1}x\right), ~~ f_4(x\,,\theta_4) = \text{max}\left(0,w_{4,0} + w_{4,1}x\right), ... \end{equation}
  • Internal parameters make each function flexible
InΒ [79]:
# import Draw_Bases class for visualizing various basis element types
demo = nonlib.DrawBases.Visualizer()

# plot the first 4 elements of the polynomial basis
demo.show_1d_net(num_layers = 1,activation = 'relu')
InΒ [76]:
demo = nonlib.regression_basis_single.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fit(basis='tanh',num_elements = [v for v in range(1,5)])
Out[76]:



  • For two inputs - $x_1$ and $x_2$ - the catalog of single layer functions with relu activation
\begin{equation} f_1(x\,,\theta_1) = \text{tanh}\left(w_{1,0} + w_{1,1}x_1 + w_{1,2}x_2\right), ~~ f_2(x\,,\theta_2) = \text{tanh}\left(w_{2,0} + w_{2,1}x_1 + w_{2,2}x_2\right) ... \end{equation}
  • or likewise
\begin{equation} f_1(x\,,\theta_1) = \text{max}\left(0,w_{1,0} + w_{1,1}x_1 + w_{1,2}x_2\right), ~~ f_2(x\,,\theta_2) = \text{max}\left(0,w_{2,0} + w_{2,1}x_1 + w_{2,2}x_2\right) ... \end{equation}
InΒ [46]:
demo = nonlib.regression_basis_comparison_3d.Visualizer()
csvname = datapath + '3d_noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fits(num_elements =  [v for v in range(1,12)] ,view = [20,110],basis = 'net')
Out[46]:



  • Generalizes to $N$ dimensional input
  • General attributes of neural network bases:
    • elements are of the same general type, each has adjustable shape (with internal parameters)
    • leads to non-convex cost functions for regression / classification
    • especially great for large high dimensional input problems
    • often used with logistic regression (for historical reasons)

3. TreesΒΆ

  • Basic example: depth-1 tree, also known as a stump
\begin{equation} f_1(x) = \begin{cases} x < V_1 \,\,\,\,\, a_1 \\ x \geq V_1 \,\,\,\,\, b_1 \end{cases} ~~~~~~~~ f_2(x) = \begin{cases} x < V_2 \,\,\,\,\, a_2 \\ x \geq V_2 \,\,\,\,\, b_2 \end{cases} ~~ \cdots \end{equation}
  • Vocab: for $f_j$ the value $V_j$ called a split point, $a_j$ and $b_j$ called levels
InΒ [82]:
# import Draw_Bases class for visualizing various basis element types
demo = nonlib.DrawBases.Visualizer()

# plot the first 4 elements of the polynomial basis
demo.show_1d_tree(depth = 1)
  • Split points and levels set directly using data, not internal parameters
  • e.g., a split point placed between two inputs, left / right level set to average of output to the left / right
InΒ [84]:
demo = nonlib.stump_visualizer_2d.Visualizer()
csvname = datapath + 'noisy_sin_raised.csv'
demo.load_data(csvname)
demo.browse_stumps()
Out[84]:



InΒ [3]:
demo = nonlib.regression_basis_single.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fit(basis='tree',num_elements = [v for v in range(1,10)])
Out[3]:



InΒ [49]:
demo = nonlib.regression_basis_comparison_3d.Visualizer()
csvname = datapath + '3d_noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fits(num_elements =  [v for v in range(1,20)] ,view = [20,110],basis = 'tree')
Out[49]:



  • Generalizes to $N$ dimensional input (split points / levels in each individual dimension)
  • General attributes of tree bases:
    • each element is adjustable, with internal parameters pre-set greedily to data
    • leads to convex cost functions for regression / classification, trained via boosting (greedy coordinate descent)
    • great for large sparse datasets

How many elements to choose?ΒΆ

  • The more elements we use the better we fit our data (provided we optimize correctly)
  • This is good in theory, but not in practice
  • In theory -with infinite clean data and ininite computation - the more elements (of any basis) --> better fit / separation
  • In fact: in such an instance all three catalogues equal, and can perfectly represent / separate
InΒ [5]:
demo = nonlib.regression_basis_comparison_2d.Visualizer()
csvname = datapath + 'sin_function.csv'
demo.load_data(csvname)
demo.brows_fits(num_elements = [v for v in range(1,50,1)])
Out[5]:



  • With real data however more elements $\neq$ better fit
InΒ [4]:
demo = nonlib.regression_basis_comparison_2d.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo.load_data(cvsname)
demo.brows_fits(num_elements = [v for v in range(1,25)])
Out[4]:



  • Nothing has 'gone wrong' here - we just do not know ahead of time how many elements is 'too little' and how many is 'too much' (in terms of nonlinearity)
  • i.e., error on training set always goes down as we add elements (provided we optimize correctly)
InΒ [7]:
demo = nonlib.regression_basis_single.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fit(basis='poly',num_elements = [v for v in range(1,25)])
Out[7]:



  • When fit is not nonlinear enough / we use too few basis features --> call this underfitting
  • When fit is too nonlinear / we use too many basis features --> call this overfitting
  • The way of finding best combination of elements to use called cross-validation

The fundamentals of cross-validationΒΆ

  • A reasonable diagnosis of the overfitting/underfitting problems is that both fail at representing new data, generated via the same process by which the current data was made, that we can potentially receive in the future.
  • But of course we do not have access to any "new data we will receive in the future"
  • So we simulate this scenario by splitting our data into two subsets: a larger training set of data we already have, and a smaller testing set of data that we "will receive in the future."
  • Then, we can try a range of features, fitting each batch to the training set of known data and measure its accuracy on both the training and testing sets
  • We pick the number that gives us the lowest error on our 'future data' i.e., on our testing set
  • e.g., with a simple example and polynomials
InΒ [34]:
demo = nonlib.regression_basis_single.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_cross_val(basis='poly',num_elements = [v for v in range(1,10)],folds = 3)
Out[34]:



  • e.g., with a simple example and tanh units
InΒ [38]:
demo = nonlib.regression_basis_single.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_cross_val(basis='tanh',num_elements = [v for v in range(1,10)],folds = 3)
Out[38]:



How to choose training/testing split?ΒΆ

  • Cut dataset randomly into $K$ non-overlapping folds, use one fold for test $K-1$ for train

  • How many folds? typically between 3 and 10
  • The smaller / less representative the dataset the more folds should be used i.e., the fewer datapoints we can pretend to 'receive in the future' i.e., the test set
  • The extreme case with very small datasets: 'leave-one-out cross-validation', test set a single point!
  • For small to medium sized datasets its worth performing cross-validation several times, picking the winner on the average
  • This is called 'K-fold cross validation', perform the same routine using one fold as test / K-1 as training cycling the testing fold
  • Reduces the chance that we pick a non-representative testing set (i.e., that we do not overfit to the test set)
  • Very computationally intensive!